Synopsis

This milestone report is part of the Capstone Course of the Data Science Specialization on Coursera. The goal of this project is to create a predictive text model using data from SwiftKey. The data was obtained via datascraping the internet for blogposts, news articles and tweets.

Data Reading and Cleaning

The data comes from three separate files. One for each of blogs, news and twitter data sources. I start the report by first analysing each dataset on it’s own. Later, I combine them and analyse them all together.

Exploratory Analysis

I will start by exploring which words are the most common in each of the datasets. Before doing so I will remove common words known as stop words. This is words like the, is, and, etc.

Wordcount Graphs

We see a lot of similarities between the different sources but also some differences. The twitter data contains more abbreviations like lol, news contains words like police and other official-sounding words. The blogs data seems to have the most ordinary corpus.

Word Clouds

We get additional visualizations of the datasets using wordclouds. Wordclouds can give us a better conceptual understanding of the datasets. I have added a sentiment analysis onto the plots so we can also see which positive and negative words are contained in each dataset.

Blogs

News

Twitter

Correlation Plots

Blending the Datasets

Exploratory plots

Word Counts

Graphs